Cycle analysis

Sample workflow for cell cycle analysis of E. coli grown in LB (Dataset D1).

Please note: the scVI method model.get_normalized_expression() involves random sampling, performing it multiple times using the same model/data will give slightly different results. When originally running this code, we unfortunately did not set a seed, making it hard to precisely reproduce the results. Therefore, running this notebook will produce slightly different results from those reported in the paper (we do not anticipate that this will change any of the conclusions drawn in the paper). We have included the original outputs used in the paper within the samples folder.

Load in data. Here, we are going to load in the same scVI model used in the paper (but you can load in the output model from initial_processing.ipynb).

Load in the GFF:

1. Deriving cell angles

We use the UMAP-based approach to derive cell angles (cell cycle states).

First, get the scVI-smoothed data and log-transform and scale it:

Do UMAP analysis. Let's start by doing it on the unbinned data. As we see, we get a rather formless blob from this:

We can remove a lot of the noise by binning into 100 kb bins along the chromosome (note this is the same binning approach used to create the chromsome-wide correlation plots):

Use these functions to create expression averaged in 100 kb bins (we do it on the scaled data to allow averaging of expression of genes at very different abundance levels):

Now fit a UMAP. If you want to reproduce the exact embeddings from the paper, these are available in samples/lb_umap_embeddings.txt).

Now we can use this plot to calculate the cell angles:

Again, there will be slight numerical differences between cell angles calculated in this notebook and cell angles used in the paper. The original cell angles are included in samples/lb_cell_angles.txt.

As a point of interest, let's compare the new cell angles from the cell angles in the paper as a density plot:

Note that there is clearly a very strong, linear overlap between the two. However, the orientation is different (the relationship is inverse). This is because the orientiation of axes on the UMAP is arbitary. We always check the relationship between angle and expression and flip if needed (see below). Also note that the phase is slightly different (see the top right corner of the plot), arising because the initial zero cell angle is arbitrary. We standardize this later.

We can also project these angles onto the unbinned UMAP. This shows that the binning helps to form the wheel by reducing noise, but actually the ordering of the cells is still largely conserved even in the unbinned "blob":

Finally, we can bin gene expression by cell angle to get our 100 bins used throughout the paper:

Now plot this. Note that initially the angles are the reverse of that seen in the paper. As explained above, this is because the orientation of UMAP is arbitrary (and highly sensitive to initiation conditions).

So we just flip the direction of the angles:

Then plot again:

We save the expression matrix as follows.

For the sake of reproducibility, again the angle binned expression matrix used in the paper is available at samples/lb_binned_expression_scvi.txt.

2. Deriving observed and predicted gene angles

To get gene angles, we just take the transpose of the binned expression matrix above and do PCA on it. Gene angles are the angles relative to the origin and PCs 1 and 2.

The next stage is to do the regression analysis for prediction of gene angles based on the origin distance. We do this externally in R using the Rstan package. Therefore, we need to save the data. We scale both gene angles and origin distances from -pi to pi, which allows us to treat both as angles in radians within the model:

The next bit needs to be run in R studio using the script origin_angle_circular_model.R. For the training data, fitted model in the paper and predicted angles, see:

There are also a few of important values we get from this:

Finally, we can load back in the predicted gene angles and superimpose these on the plot above:

3. Developing a prediction of cell cycle gene expression based on origin position:

We do this by first developing a matrix of expression binned 1) by cell angle, as above then 2) by gene angle. We then have to convert this into long format to provide training data for the regression model in sklearn:

We can save this comparison (the exact matrix used in the paper is at samples/cell_angle_gene_angle_relationship.txt):

Now we build the regression model. We make features based on the sine and cosine of each angle, and polynomial interactions between each. To do this, we use the PolynomialFeatures function:

This makes 71 features in total, making this a very flexible model. Next, we fit the regression model. We use a bit of ridge (L2) regularization to stop this model from over-fitting the data:

Finally, we use this to predict gene expression for the full set of genes. This becomes essentially a smoothed version of the plot above:

Now we can check how good a job this does. Firstly, the correlation between predicted and observed data is fairly good:

We can also assess how well this captures the global replication effect by looking at gene-gene correlations at the predicted and observed angle-binned expression matrices, and the difference between the two:

What this shows is that the predicted data are essentially a smoothed version of the observed data. The corrected data (observed - predicted) shows that the correlation is almost eliminated. This suggests that our model captures the majority of the chromosome position-dependent signal.

The replication-predicted gene expression matrix is saved as follows (note - the matrix used in the paper is available at samples/lb_binned_expression_predicted.txt):

4. Identifying variable genes

We describe a method in the paper to identify genes based on 1) relatively high variability within the cell cycle and 2) divergence from replication-predicted expectations. This can be can be carried out as follows:

Identifying highly variable genes

First, we need a course-grained average of expression within the cell cycle (here we use the raw counts instead of the scVI-smoothed data).

In order to log-transform the data, we need to remove genes that have zeros still in the data after course-grained binning:

Now log-transform the binned data and calculate the mean and variance of these:

Finally, do a LOWESS regression to fit a mean-variance relationship. This is based on the approach in Seurat used for variable feature selection. This fit avoids over-selection of low-expressed genes, that tend to have higher log-variance.

Now calculate a ratio of observed log-variance compared to the LOWESS fit. We set a standard threshold of 1.3, with a log-mean threshold cutoff of -10.

Calculate the fraction of genes classed as highly variable (here, 22%):

Next, we calculate a deviance metric to distinguish divergent and non-divergent genes. This is simply the standard deviation of cell cycle expression after correcting for the replication effect:

We class as divergent genes that are both highly variable and score above the threshold in the divergence score. In the paper we only include genes that meet the criteria for both replicates but here we can determine it for just one replicate:

Here, 4.6% of genes fit these criteria. As we discover in the paper, many genes that are non-divergent fit a canonical gene dosage-dependent pattern of expression, whereas the divergent genes exhibit dynamics that arise due to other mechanisms. However, while useful for initial characterization, this divergence score likely under-estimates the number of genes that diverge from a gene-dosage pattern because it works with scaled data. Therefore, while it can account for divergences due to shape, delay etc., it does not account for amplitudes that are different from expectations.

5. Defining a zero angle to standardize cell cycle expression

We want to define a zero angle that allows us to compare across different replicates. While this standardization is the primary purpose, we want to give this some biological meaning. Therefore, we chose to take the predicted minimum expression of an imaginary gene at the origin of replication. This is because the minimum is likely to correspond approximately with the replication time (since dosage-dependent genes will increase in expression upon their replication).

First, predict cell cycle expression for a gene at the origin (the predicted minimum is taken from the Rstan analysis):

Now determine the minimum cell angle bin:

The zero angle used in the paper is 62.9596°, reflecting the fact that the new angle estimates are slightly out of phase with the ones used in the paper.

We can save these data for downstream analysis, particularly the predicted and observed gene angles. The ones used in the paper itself are available at samples/lb_gene_information.txt.